
Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:

To accomplish your task you will:
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
# %pip install --upgrade plotly
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
pd.options.display.float_format = '{:,.2f}'.format
The first column in the .csv file just has the row numbers, so it will be used as the index.
data = pd.read_csv('boston.csv', index_col=0)
Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.
:Attribute Information (in order):
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. PRICE Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.
Challenge
data? data
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 18.00 | 2.31 | 0.00 | 0.54 | 6.58 | 65.20 | 4.09 | 1.00 | 296.00 | 15.30 | 396.90 | 4.98 | 24.00 |
| 1 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 6.42 | 78.90 | 4.97 | 2.00 | 242.00 | 17.80 | 396.90 | 9.14 | 21.60 |
| 2 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 7.18 | 61.10 | 4.97 | 2.00 | 242.00 | 17.80 | 392.83 | 4.03 | 34.70 |
| 3 | 0.03 | 0.00 | 2.18 | 0.00 | 0.46 | 7.00 | 45.80 | 6.06 | 3.00 | 222.00 | 18.70 | 394.63 | 2.94 | 33.40 |
| 4 | 0.07 | 0.00 | 2.18 | 0.00 | 0.46 | 7.15 | 54.20 | 6.06 | 3.00 | 222.00 | 18.70 | 396.90 | 5.33 | 36.20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | 0.06 | 0.00 | 11.93 | 0.00 | 0.57 | 6.59 | 69.10 | 2.48 | 1.00 | 273.00 | 21.00 | 391.99 | 9.67 | 22.40 |
| 502 | 0.05 | 0.00 | 11.93 | 0.00 | 0.57 | 6.12 | 76.70 | 2.29 | 1.00 | 273.00 | 21.00 | 396.90 | 9.08 | 20.60 |
| 503 | 0.06 | 0.00 | 11.93 | 0.00 | 0.57 | 6.98 | 91.00 | 2.17 | 1.00 | 273.00 | 21.00 | 396.90 | 5.64 | 23.90 |
| 504 | 0.11 | 0.00 | 11.93 | 0.00 | 0.57 | 6.79 | 89.30 | 2.39 | 1.00 | 273.00 | 21.00 | 393.45 | 6.48 | 22.00 |
| 505 | 0.05 | 0.00 | 11.93 | 0.00 | 0.57 | 6.03 | 80.80 | 2.50 | 1.00 | 273.00 | 21.00 | 396.90 | 7.88 | 11.90 |
506 rows × 14 columns
data.dtypes
CRIM float64 ZN float64 INDUS float64 CHAS float64 NOX float64 RM float64 AGE float64 DIS float64 RAD float64 TAX float64 PTRATIO float64 B float64 LSTAT float64 PRICE float64 dtype: object
display(data.shape)
(506, 14)
print(f"The data has {data.shape[0]} rows and {data.shape[1]} columns.")
The data has 506 rows and 14 columns.
To get a list of the column names:
data.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'PRICE'],
dtype='object')
data.isna().any().any()
False
data.duplicated().any()
False
--> No NaN and no duplicated values.
Challenge
CHAS feature? CHAS and why?data_stats = data.describe()
data_stats
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 |
| mean | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 | 22.53 |
| std | 8.60 | 23.32 | 6.86 | 0.25 | 0.12 | 0.70 | 28.15 | 2.11 | 8.71 | 168.54 | 2.16 | 91.29 | 7.14 | 9.20 |
| min | 0.01 | 0.00 | 0.46 | 0.00 | 0.39 | 3.56 | 2.90 | 1.13 | 1.00 | 187.00 | 12.60 | 0.32 | 1.73 | 5.00 |
| 25% | 0.08 | 0.00 | 5.19 | 0.00 | 0.45 | 5.89 | 45.02 | 2.10 | 4.00 | 279.00 | 17.40 | 375.38 | 6.95 | 17.02 |
| 50% | 0.26 | 0.00 | 9.69 | 0.00 | 0.54 | 6.21 | 77.50 | 3.21 | 5.00 | 330.00 | 19.05 | 391.44 | 11.36 | 21.20 |
| 75% | 3.68 | 12.50 | 18.10 | 0.00 | 0.62 | 6.62 | 94.07 | 5.19 | 24.00 | 666.00 | 20.20 | 396.23 | 16.96 | 25.00 |
| max | 88.98 | 100.00 | 27.74 | 1.00 | 0.87 | 8.78 | 100.00 | 12.13 | 24.00 | 711.00 | 22.00 | 396.90 | 37.97 | 50.00 |
print(f"There are {data_stats.loc['mean'].PTRATIO:.0f} students "\
"per teacher on average.")
There are 18 students per teacher on average.
print("The average price of a home in the dataset is "\
f"${data_stats.loc['mean'].PRICE * 1000:,.0f}.")
The average price of a home in the dataset is $22,533.
We've experienced a lot of inflation and house price appreciation since then!
CHAS feature: "CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
"Tract housing is a type of housing development in which multiple similar houses are built on a tract (area) of land that is subdivided into smaller lots."
So basically CHAS has value 1 if the home land area borders the river.
As such, it only has the value 0 or 1. This kind of feature is also known as a dummy variable.
print("In the dataset, the maximum avg number of rooms "\
f"per dwelling is {data_stats.loc['max'].RM:.2f} "\
f"and the minimum is {data_stats.loc['min'].RM:.2f}.")
In the dataset, the maximum avg number of rooms per dwelling is 8.78 and the minimum is 3.56.
Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn's .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:
Try setting the aspect parameter to 2 for a better picture.
What do you notice in the distributions of the data?
The size of the graph must be managed with the height= and aspect= parameters, as seen previously.
height: Height (in inches) of each facet.
aspect: Aspect ratio of each facet, so that aspect height gives the width of each facet in inches.*
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
dis = sns.displot(data=data, x='PRICE',
height=4, aspect=2,
kde=True, color='darkorange',
bins=50)
dis.set(xlabel='Price in $ thousands')
plt.title(f'1970s Home Values in Boston. '\
f'Average: ${(1000*data.PRICE.mean()):,.0f}')
plt.show()
<Figure size 1280x960 with 0 Axes>
Remark: We can note that there is a spike in the number of homes at the very right tail at the $50,000 mark.
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
dis = sns.displot(data=data, x='DIS',
height=4, aspect=2,
kde=True, color='green',
bins=25)
dis.set(xlabel='Weighted Distance to 5 Boston Employment Centres')
plt.title(
f'Distance to Employment Centres. Average: {(data.DIS.mean()):.2}')
plt.show()
<Figure size 1280x960 with 0 Axes>
Remark: Most homes are about 3.8 miles away from work. There are fewer and fewer homes the further out we go.
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
dis = sns.displot(data=data, x='RM',
height=4, aspect=2,
kde=True, color='firebrick'
)
dis.set(xlabel='Average Number of Rooms per Dwelling')
plt.title(
f'Distribution of Rooms in Boston. Average: {data.RM.mean():.2}')
plt.show()
<Figure size 1280x960 with 0 Axes>
Note: We can change the width of the bins with binwidth=0.5 for example.
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
dis = sns.displot(data=data, x='RAD',
height=4, aspect=2,
kde=True, color='darkblue',
bins=25,
ec='black') # ec colors the border of the bin
dis.set(xlabel='Index of Accessibility to Radial Highways')
plt.show()
<Figure size 1280x960 with 0 Axes>
Remark: RAD is an index of accessibility to roads. Better access to a highway is represented by a higher number. There's a big gap in the values of the index.
Challenge
Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it.
You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])
Regarding the display of NumPy's arrays:
By default, the view is only summarized/truncated with 1000 elements. We can change the display options with .set_printoptions().
threshold is the nb of elements before an array is summerized, edgeitems is the number at beginning and end of summary.
I can also set the display options temporarily like so with np.printoptions()
yes_no_river = np.where(data.CHAS == 1, 'Yes', 'No')
len(yes_no_river)
506
# np.set_printoptions(threshold=500, edgeitems=10)
with np.printoptions(threshold=500, edgeitems=10):
display(yes_no_river)
array(['No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', ...,
'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No'],
dtype='<U3')
fig = px.histogram(x=yes_no_river, histfunc='count',
color=yes_no_river,
labels={'x': 'Property Located Next to the River?'},
title='Next to Charles River?',
# color_discrete_sequence=px.colors.sequential.haline
)
fig.update_layout(showlegend=False, yaxis_title='Count')
fig.show()
Course Solution:
river_access= data.CHAS.value_counts()
river_access
0.00 471 1.00 35 Name: CHAS, dtype: int64
fig = px.bar(x=['No', 'Yes'], y=river_access.values,
color=river_access.values,
labels={'x': 'Property Located Next to the River?',
'y': 'Count'},
title='Next to Charles River?',
color_continuous_scale='haline')
fig.update_layout(coloraxis_showscale=False)
fig.show()
We see that out of the total number of 506 homes, only 35 are located next to the Charles River.

Challenge
There might be some relationships in the data that we should know about. Before you run the code, make some predictions:
Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.
data[['NOX', 'DIS', 'RM', 'PRICE', 'LSTAT']]
| NOX | DIS | RM | PRICE | LSTAT | |
|---|---|---|---|---|---|
| 0 | 0.54 | 4.09 | 6.58 | 24.00 | 4.98 |
| 1 | 0.47 | 4.97 | 6.42 | 21.60 | 9.14 |
| 2 | 0.47 | 4.97 | 7.18 | 34.70 | 4.03 |
| 3 | 0.46 | 6.06 | 7.00 | 33.40 | 2.94 |
| 4 | 0.46 | 6.06 | 7.15 | 36.20 | 5.33 |
| ... | ... | ... | ... | ... | ... |
| 501 | 0.57 | 2.48 | 6.59 | 22.40 | 9.67 |
| 502 | 0.57 | 2.29 | 6.12 | 20.60 | 9.08 |
| 503 | 0.57 | 2.17 | 6.98 | 23.90 | 5.64 |
| 504 | 0.57 | 2.39 | 6.79 | 22.00 | 6.48 |
| 505 | 0.57 | 2.50 | 6.03 | 11.90 | 7.88 |
506 rows × 5 columns
sns.pairplot(data[['NOX', 'DIS', 'RM', 'PRICE', 'LSTAT']],
kind='reg', # to include a regression line
plot_kws={'line_kws':{'color': 'cyan'},
'lowess': True})
plt.show()
We can also do the pairplot on the whole dataset, but it makes a huge plot that is very difficult to read: sns.pairplot(data)
We can clearly see some trend:
Challenge
Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:
Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.
Challenge:
Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?
g = sns.jointplot(data[['DIS', 'NOX']],
x='DIS', y='NOX',
kind='reg', ylim=(0.35, 0.9),
joint_kws={'scatter_kws': {'edgecolor': 'white',
'alpha': 0.5},
'line_kws': {'color': 'orange'},
'lowess': True})
plt.show()
We see that pollution goes down as we go further and further out of town. This makes intuitive sense. However, even at the same distance of 2 miles to employment centres, we can get very different levels of pollution. By the same token, DIS of 9 miles and 12 miles have very similar levels of pollution.
Challenge:
Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?
g = sns.jointplot(data[['INDUS', 'NOX']],
x='INDUS', y='NOX',
kind='reg', color='indigo',
joint_kws={'scatter_kws': {'edgecolor': 'white',
'alpha': 0.3},
'line_kws': {'color': 'darkred'}
},
)
plt.show()
Challenge
Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?
g = sns.jointplot(data[['LSTAT', 'RM']],
x='LSTAT', y='RM',
kind='reg', color='orangered',
joint_kws={'scatter_kws': {'edgecolor': 'white',
'alpha': 0.7},
'line_kws': {'color': 'dimgrey'},
}
)
plt.show()
In the top left corner we see that all the homes with 8 or more rooms, LSTAT is well below 10%.
Challenge
Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?
g = sns.jointplot(data[['LSTAT', 'PRICE']],
x='LSTAT', y='PRICE',
kind='scatter', color='darkgreen',
joint_kws={'edgecolor': 'white',
'alpha': 0.7},
)
plt.show()
Challenge
Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊
g = sns.jointplot(data, x='RM', y='PRICE',
kind='reg', ylim=(3, 53),
color='maroon',
joint_kws={'scatter_kws': {'edgecolor': 'white',
'alpha': 0.8},
'line_kws': {'color': 'darkgoldenrod'},
'lowess': True
}
)
plt.show()
Again, we see those homes at the $50,000 mark all lined up at the top of the chart. Perhaps there was some sort of cap or maximum value imposed during data collection.
We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.
In other words, we don't want to train our model with all 506 entries, and then take it out in the world already. What we want to do instead is train the model with, say, 80% of the data, and then test it on the remaining 20%. That way, we can already test the model with real data that the model hasn't seen yet.
Challenge
train_test_split() function from sklearn --> good documentation!random_state=10. This helps us get the same results every time and avoid confusion while we're learning ("random_state: Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.")Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.
from sklearn.model_selection import train_test_split
X = data.drop('PRICE', axis='columns')
y = data.PRICE
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=10)
In a previous lesson, we had a linear model with only a single feature (our movie budgets):
$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$And we did the following:
regression = LinearRegression()
# Explanatory Variable(s) or Feature(s)
X = new_films.USD_Production_Budget.to_frame()
# Response Variable or Target
y = new_films.USD_Worldwide_Gross.to_frame()
regression.fit(X, y)
# Theta zero
regression.intercept_
# Theta one
regression.coef_
# R-squared -> how well our model fits our data
# Anything over 50% is a very decent model
regression.score(X, y)
# Running the model
budget = 350000000
revenue_estimate = regression.intercept_[0] + regression.coef_[0,0]*budget
This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:
$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$SO I should have 13 coefficients (theta 1 to 13), and one intercept (theta 0).
Challenge
Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?
regression = LinearRegression()
regression.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
regression.intercept_
36.53305138282448
regression.coef_
array([-1.28180656e-01, 6.31981786e-02, -7.57627602e-03, 1.97451452e+00,
-1.62719890e+01, 3.10845625e+00, 1.62922153e-02, -1.48301360e+00,
3.03988206e-01, -1.20820710e-02, -8.20305699e-01, 1.14189890e-02,
-5.81626431e-01])
rsquare = regression.score(X_train, y_train)
rsquare
0.750121534530608
Wow, R-squared at 75%! That is amazing!
regression.score(X_test, y_test)
0.6709339839115642
And 67% on the test data, this is still very good!
Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).
Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.
data.iloc[:, :13].columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT'],
dtype='object')
coeffs = pd.DataFrame(regression.coef_,
index=data.iloc[:, :13].columns,
columns=['Coef']).T
coeffs
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coef | -0.13 | 0.06 | -0.01 | 1.97 | -16.27 | 3.11 | 0.02 | -1.48 | 0.30 | -0.01 | -0.82 | 0.01 | -0.58 |
According to our previous relationships:
print(f"According to the model, you'd have to "\
f"pay about ${float(coeffs.RM)*1000:.0f} more for an extra room.")
According to the model, you'd have to pay about $3108 more for an extra room.
The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.
predicted_values = regression.predict(X_train)
print(f"Length of the predicted values array: {len(predicted_values)}\n")
with np.printoptions(threshold=100, edgeitems=5):
display(predicted_values)
Length of the predicted values array: 404
array([21.02958601, 12.21844467, 13.74785342, 20.7351517 , 23.41262356,
..., 24.84599397, 19.89193201, 19.16746008, 22.40634226,
28.57061195])
residuals = (y_train - predicted_values)
print(f"Length of the residuals array: {len(predicted_values)}\n")
residuals
Length of the residuals array: 404
50 -1.33
367 10.88
34 -0.25
78 0.46
172 -0.31
...
320 -1.05
15 0.01
484 1.43
125 -1.01
265 -5.77
Name: PRICE, Length: 404, dtype: float64
Challenge: Create two scatter plots.
The first plot should be actual values (y_train) against the predicted value values.
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
g = sns.scatterplot(x=y_train, y=predicted_values,
c='indigo', alpha=0.6)
sns.lineplot(x=y_train, y=y_train, c='cyan')
g.set(title=f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$',
xlabel=f'Actual prices in \$ thousands $y _i$',
ylabel=f'Predicted prices in \$ thousands $\hat y _i$')
plt.show()
The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊
The second plot should be the residuals against the predicted prices.
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
g = sns.scatterplot(x=predicted_values, y=residuals,
c='indigo', alpha=0.6)
g.set(title=f'Residuals vs Predicted Values',
xlabel='Predicted prices in \$ thousands $\hat y _i$',
ylabel='Residuals')
plt.show()
Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.
We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.
In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:

Challenge
.displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)print(f"The mean of our residuals is {residuals.mean():.20f}")
display(residuals.mean())
resid_mean = round(residuals.mean(), 2)
The mean of our residuals is -0.00000000000000367583
-3.675827519154974e-15
Damn! That's as close to zero as you can get!
print(f"The skewness of our residuals is {residuals.skew():.4f}")
resid_skew = round(residuals.skew(), 2)
The skewness of our residuals is 1.4594
That would mean our residuals distribution is not really symmetrical. Let's see:
plt.figure(dpi=120)
with sns.axes_style('ticks'):
dis = sns.displot(x=residuals, bins=50,
kde=True, height=6,
color='indigo')
dis.set(xlim=(-25, 25),
title=f'Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()
<Figure size 768x576 with 0 Axes>
We see that the residuals have a skewness of 1.46. There could be some room for improvement here.
We have two options at this point:
Let's try a data transformation approach.
Challenge
Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.
.displot() to show a histogram and KDE of the price data. log() function to create a Series that has the log prices.displot() and calculate the skew. data.PRICE.mean()
22.532806324110677
We can set the xlim at 2*mean = 45, to better see if symmetrical.
print(f"The skewness of the target PRICE is: "\
f"{data.PRICE.skew():.3f}")
tgt_skew = data['PRICE'].skew()
The skewness of the target PRICE is: 1.108
plt.figure(dpi=120)
with sns.axes_style('darkgrid'):
dis = sns.displot(x=data.PRICE, bins=50,
kde=True, height=5.5,
color='green')
dis.set(title=f'Data PRICE Distribution. Skew is {tgt_skew:.3}',
xlim=(0, 45))
plt.show()
<Figure size 768x576 with 0 Axes>
log_prices = np.log(data.PRICE)
log_prices
0 3.18
1 3.07
2 3.55
3 3.51
4 3.59
...
501 3.11
502 3.03
503 3.17
504 3.09
505 2.48
Name: PRICE, Length: 506, dtype: float64
log_skew = log_prices.skew()
log_skew
-0.33032129530987864
log_prices.mean()
3.0345128744146552
plt.figure(dpi=120)
with sns.axes_style('darkgrid'):
dis = sns.displot(x=log_prices, bins=50,
kde=True, height=5.5,
color='green')
dis.set(title=f'Log() Data PRICE Distribution. Skew is {log_skew:.3}',
# xlim=(0, 6)
xlim=(1.5,4.5)
)
plt.show()
<Figure size 768x576 with 0 Axes>
The target (log(PRICE)) is actually more symmetrical now.
The log prices have a skew that's closer to zero. This makes them a good candidate for use in our linear model. Perhaps using log prices will improve our regression's r-squared and our model's residuals.
Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:

We can see this when we plot the actual prices against the (transformed) log prices.
plt.figure(dpi=150)
plt.scatter(data.PRICE, np.log(data.PRICE))
plt.title('Mapping the Original Price to a Log Price')
plt.ylabel('Log Price')
plt.xlabel('Actual Price in $ thousands')
plt.show()
Using log prices instead, our model has changed to:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$Challenge:
train_test_split() with the same random state as before to make the results comparable. X_train, X_test, log_y_train, log_y_test =\
train_test_split(X, log_prices, train_size=0.8, random_state=10)
regression_2 = LinearRegression()
regression_2.fit(X_train, log_y_train)
regression_2.intercept_
4.0599438717751966
coeffs_2 = pd.DataFrame(regression_2.coef_, index=X_train.columns,
columns=['Coef']).T
coeffs_2
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coef | -0.01 | 0.00 | 0.00 | 0.08 | -0.70 | 0.07 | 0.00 | -0.05 | 0.01 | -0.00 | -0.03 | 0.00 | -0.03 |
rsquared_2 = regression_2.score(X_train, log_y_train)
print(f'New training data r-squared: {rsquared_2:.2}')
New training data r-squared: 0.79
Even better than before! From 0.75 to 0.79!
round(regression_2.score(X_test, log_y_test),2)
0.74
And with the test data, also better than before! From 0.67 to 0.74!
Challenge: Print out the coefficients of the new regression model.
Hint: Use a DataFrame to make the output look pretty.
coeffs_2
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coef | -0.01 | 0.00 | 0.00 | 0.08 | -0.70 | 0.07 | 0.00 | -0.05 | 0.01 | -0.00 | -0.03 | 0.00 | -0.03 |
The coefficients still make sense. More crime (CRIM) negative, more rooms (RM) positive, more pollution (NOX) highly negative, poverty (LSTAT) negative.
Being next to the river impacts the PRICE positively too.
with pd.option_context('display.min_rows', 4):
display(data.PTRATIO)
0 15.30
1 17.80
...
504 21.00
505 21.00
Name: PTRATIO, Length: 506, dtype: float64
We have a negative coef for LSTAT, meaning the more students per teacher, the lower the house prices.
Course:
So how can we interpret the coefficients? The key thing we look for is still the sign - being close to the river results in higher property prices because CHAS has a coefficient greater than zero. Therefore property prices are higher next to the river.
More students per teacher - a higher PTRATIO - is a clear negative. Smaller classroom sizes are indicative of higher quality education, so have a negative coefficient for PTRATIO.
Challenge:
indigo as the colour for the original regression and navy for the color using log prices.predicted_log_values = regression_2.predict(X_train)
log_residuals = log_y_train - predicted_log_values
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
g = sns.scatterplot(x=y_train, y=predicted_values,
c='indigo', alpha=0.6)
sns.lineplot(x=y_train, y=y_train, c='cyan')
g.set(title=f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$'\
f' (R-Squared {rsquare:.2f})',
xlabel=f'Actual prices in \$ thousands $y _i$',
ylabel=f'Predicted prices in \$ thousands $\hat y _i$')
plt.show()
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
g = sns.scatterplot(x=log_y_train, y=predicted_log_values,
c='navy', alpha=0.6)
sns.lineplot(x=log_y_train, y=log_y_train, c='cyan')
g.set(title=f'Actual vs Predicted Log Prices: $y _i$ vs $\hat y_i$'\
f' (R-Squared {rsquared_2:.2f})',
xlabel=f'Actual Log Prices $y _i$',
ylabel=f'Predicted Log Prices $\hat y _i$')
plt.show()
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
g = sns.scatterplot(x=predicted_values, y=residuals,
c='indigo', alpha=0.6)
g.set(title=f'Residuals vs Predicted Values',
xlabel='Predicted prices in \$ thousands $\hat y _i$',
ylabel='Residuals')
plt.show()
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
g = sns.scatterplot(x=predicted_log_values, y=log_residuals,
c='navy', alpha=0.6)
g.set(title=f'Residuals vs Predicted Values for Log Prices',
xlabel='Predicted Log Prices $\hat y _i$',
ylabel='Residuals')
plt.show()
It's hard to see a difference here just by eye. The predicted values seems slightly closer to the cyan line, but eyeballing the charts is not terribly helpful in this case.
Challenge:
Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?
print(log_residuals.mean(), '\n')
print(f"{log_residuals.mean():.20f}")
resid_mean_2 = round(log_residuals.mean(),2)
8.381634220561207e-16 0.00000000000000083816
The mean is even closer to zero as before. But both are at zero anyway.
print(log_residuals.skew())
resid_skew_2 = round(log_residuals.skew(), 2)
0.09299942594123323
The skew is now at 0.093 compared to 1.46 before, so definitely an improvement.
plt.figure(dpi=120)
with sns.axes_style('ticks'):
dis = sns.displot(x=residuals, bins=50,
kde=True, height=6,
color='indigo')
dis.set(xlim=(-25, 25),
title=f'Previous Model: Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()
plt.figure(dpi=120)
with sns.axes_style('ticks'):
dis = sns.displot(x=log_residuals, bins=50,
kde=True, height=6,
color='navy')
dis.set(title=f'Log Price Model: Residuals Skew ({resid_skew_2}) Mean ({resid_mean_2})')
plt.show()
<Figure size 768x576 with 0 Axes>
<Figure size 768x576 with 0 Axes>
Yep, we can definitely see that the distribution of the residuals, with the log model, is more symmetrical.
Course: Our new regression residuals have a skew of 0.09 compared to a skew of 1.46. The mean is still around 0. From both a residuals perspective and an r-squared perspective we have improved our model with the data transformation.
The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.
Challenge
Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?
print(regression.score(X_test, y_test))
print(regression_2.score(X_test, log_y_test))
0.6709339839115642 0.744692230626074
The log model does better with 0.74 compared to 0.67
However they both do worse on the test data than on the training data, which is normal, this is data it's never seen before.
Course: By definition, the model has not been optimised for the testing data. Therefore performance will be worse than on the training data. However, our r-squared still remains high, so we have built a useful model.
Our preferred model now has an equation that looks like this:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$The average property has the mean value for all its charactistics:
# Starting Point: Average Values in the Dataset
features = data.drop(['PRICE'], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)),
columns=features.columns)
property_stats
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 |
Challenge
Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.
regression_2.intercept_
4.0599438717751966
regression_2.coef_
array([-1.06717261e-02, 1.57929102e-03, 2.02989827e-03, 8.03305301e-02,
-7.04068057e-01, 7.34044072e-02, 7.63301755e-04, -4.76332789e-02,
1.45651350e-02, -6.44998303e-04, -3.47947628e-02, 5.15896157e-04,
-3.13900565e-02])
regression_2.coef_.ndim
1
property_stats.values
array([[3.61352356e+00, 1.13636364e+01, 1.11367787e+01, 6.91699605e-02,
5.54695059e-01, 6.28463439e+00, 6.85749012e+01, 3.79504269e+00,
9.54940711e+00, 4.08237154e+02, 1.84555336e+01, 3.56674032e+02,
1.26530632e+01]])
Tests multiplying arrays
np.array([1,2,3]) * np.array([1,4,1])
array([1, 8, 3])
np.array([1,2,3]) * np.array([1,4,1]).T
array([1, 8, 3])
Using the .dot() to do the "Dot product of two arrays."
Specifically, if both a and b are 1-D arrays, it is inner product of vectors.
np.dot(np.array([1,2,3]), np.array([1,1,1]))
6
YES!
predic_log_price_avg = float(regression_2.intercept_ \
+ np.dot(property_stats.values, regression_2.coef_))
print(predic_log_price_avg)
3.030287230563395
predic_price_avg = np.exp(predic_log_price_avg)
print(f"The predicted price for the average property is:"\
f" ${predic_price_avg*1000:,.0f}")
print(f"The average house price from our data is ${data.PRICE.mean()*1000:,.0f}")
The predicted price for the average property is: $20,703 The average house price from our data is $22,533
Course Solution: I don't need the .dot() function to multiply arrays, I can just use the .predict() function from our LinearRegression object.
display(regression_2.predict(property_stats))
display(np.exp(regression_2.predict(property_stats)))
array([3.03028723])
array([20.70317832])
Course: A property with an average value for all the features has a value of $20,700.
Challenge
Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:
# Define Property Characteristics
next_to_river = True
nr_rooms = 8
students_per_classroom = 20
distance_to_town = 5
pollution = data.NOX.quantile(q=0.75) # high
amount_of_poverty = data.LSTAT.quantile(q=0.25) # low
# Solution:
property_stats_2 = property_stats
if next_to_river:
property_stats_2.CHAS = 1
else:
property_stats.CHAS = 0
property_stats_2.RM = nr_rooms
property_stats_2.PTRATIO = students_per_classroom
property_stats_2.DIS = distance_to_town
property_stats_2.NOX = pollution
property_stats_2.LSTAT = amount_of_poverty
property_stats_2
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.61 | 11.36 | 11.14 | 1 | 0.62 | 8 | 68.57 | 5 | 9.55 | 408.24 | 20 | 356.67 | 6.95 |
predic_price_spec = np.exp(regression_2.predict(property_stats_2)[0]) * 1000
print(f"The predicted price for a property with these specific "\
f"features is ${predic_price_spec:,.0f}")
The predicted price for a property with these specific features is $25,792
Today we've learned:
How to quickly spot relationships in a dataset using Seaborn's .pairplot(). The graph generated can be quite big. If the DataFrame has 5 columns/parameters, they each each be paired, so we will have a graph with 25 plots. Lots of options for this plot, see the documentation. By defaut, histogram distribution in the diagonal, and scatter plot for the pairings. But we can also have a regplot instead (with param kind='reg'), which is essentially a scatter with a regression line. We can pass arguments to each different type of plots, see the example.
How to look at some of the relationships/pairings in more details with Seaborn's .jointplot(). Here also we can highly customize the plot with the params kind= and in passing parameters such as:
joint_kws={'scatter_kws': {'edgecolor': 'white', 'alpha': 0.5},
'line_kws': {'color': 'orange'},
'lowess': True}
from sklearn.model_selection import train_test_split
X = data.drop('PRICE', axis='columns')
y = data.PRICE
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=10)
regression = LinearRegression()
regression.fit(X_train, y_train)
regression.intercept_
regression.coef_
rsquare = regression.score(X_train, y_train)
coeffs = pd.DataFrame(regression.coef_, index=data.iloc[:, :13].columns, columns=['Coef']).T
.predict()) and the residuals, then analyse and look for patterns in a model's residuals, especially looking at the mean and the skewness.predicted_values = regression.predict(X_train)
residuals = (y_train - predicted_values)
# Plot actual vs predicted values
sns.scatterplot(x=y_train, y=predicted_values)
sns.lineplot(x=y_train, y=y_train)
# Plot residuals vs predicted
sns.scatterplot(x=predicted_values, y=residuals)
# Mean and skew
resid_mean = round(residuals.mean(), 2)
resid_skew = round(residuals.skew(), 2)
# Checking distribution symmetry
sns.displot(x=residuals, bins=50, kde=True)
# Checking target distr. symmetry
tgt_skew = data.PRICE.skew()
sns.displot(x=data.PRICE, bins=50, kde=True)
# Checking again after data transform (e.g. log)
log_skew = np.log(data.PRICE).skew()
sns.displot(x=np.log(data.PRICE), bins=50, kde=True)
How to specify your own values for various features and use your model to make a prediction: .predict()
That it can be interesting to display some key values in the plot title (e.g. avg/mean, r-squared, skew)
How to change the display options of NumPy with .set_printoptions(), especially for the display of arrays: threshold and edgeitems. We can also change the display options temporarily like so:
with np.printoptions(threshold=500, edgeitems=10):
display(yes_no_river)
.values of a Series on the y-axis and create on our own list of labels on the x-axis, like so fig = px.bar(x=['No', 'Yes'], y=river_access.values)Today we've reviewed:
How to get the the column names as a list: data.columns
How to plot a distribution with Seaborn's .displot() and how to add a density distribution (KDE) to that histogram with the param kde=True, and controlling the nb of bins with bins=, and changing the bins' border color with ec=
How to use the height= and aspect= parameters on some Seaborn figures, when I can not resize with plt.figure(figsize=(width,height))